Notebook 08 – Solutions

DSST289: Introduction to Data Science

Published

September 21, 2024

Getting Started

Before running this notebook, select “Session > Restart R and Clear Output” in the menu above to start a new R session. This will clear any old data sets and give us a blank slate to start with.

After starting a new session, run the following code chunk to load the libraries that we will be working with today.

I have set the options include=FALSE and message=FALSE to avoid cluttering the solutions with all the output from this code.

College Majors

The data for today is taken from this paper by a senior economist at the Federal Reserve Board:

Douglas A. Webber, “Are College Costs Worth It? How Ability, Major, and Debt Affect the Returns to Schooling,” Economics of Education Review 53 (August 1, 2016): 296–310, https://doi.org/10.1016/j.econedurev.2016.04.007.

Webber uses information from the 2014-2018 American Community Surveys (a large dataset compiled by the US Census Bureau) to estimate the lifetime earnings of college graduate based on their undergraduate major. The dataset includes not just a single number, but instead 99 data points for each major, which describes the distribution of estimated lifetime earnings for each major. You can read in the dataset with the following:

majors <- read_csv("../data/majors.csv")
majors |>
  slice_sample(n = 10)

It contains three fields:

  • major the name of the major
  • percentile percentile of earnings; includes every value from 1-99
  • earnings estimated lifetime earnings in millions of dollars

I do not believe that the estimates account for inflation.

In this notebook we will start by creating several different plots that use this data to try to understand the relationship between college major and lifetime earnings. We will then take a step back and consider how we can apply the Data Feminism concepts from the notes to consider the implications and limits of our analysis.

Density Plots

To start understanding this dataset, we are going to make use of a new type of geometry called geom_density. It is a bit different than previous geometries we have used because rather than setting the y-aesthetic, we let the geometry calculate the y-values based on the distribution of the values set to the x aesthetic.

What is a density plot?

geom_density() estimates and visualizes the probability density function of a continuous variable. Here’s how it works:

Imagine you have a set of data points, like the heights of all the students in your class. If you wanted to see how these heights are distributed, you could create a histogram, which would count how many students fall into different height ranges (or “bins”). Those bins would include ranges (e.g., students between 4’0” and 4’11” in one group, 5’0” and 5’2” in another, etc.) The histogram would you the frequency of different height intervals.

However, histograms have some problems. Most obviously, they depend on the number of bins you choose, and they can look choppy or jagged.

This is where geom_density() comes in. Instead of counting data points in bins, geom_density() uses a method called kernel density estimation to create a smooth curve that estimates the probability density function of the data.

This image helps explain how to interpret density plots:

Visualisation mode median mean Geometric visualisation of the mode, median and mean of an arbitrary probability density function by CMG Lee. moda moodin мода yếu vị modusмодусεπικρατούσα τιμήModusmodusmode 50%50%50%50% 50%50%50%50% mediana mediaanin медіана trung vị medianмедијанаδιάμεσοςMedianmediaanmedian media odotusarvo середнє bình quân rata-rataсрединаμέση τιμήMittelgemiddeldemean

Mean, median, and mode in density plots

This general understanding is sufficient for our course. If you would like to learn a bit more about how this works, I recommend this video.

Making a density plot

In the code block below, take on the data points from the major “Psychology” and draw a ggplot graphic with earnings on the x-axis using a geom_density() layer.

majors |>
  filter(major == "Psychology") |>
  ggplot(aes(x = earnings)) +
  geom_density()

Shading

I find density plots easier to read if we shade the area underneath them with a color. In the code below adapt your first plot by setting the fill aesthetic to the fixed value “black” and the alpha aesthetic to the fixed value 0.3.

majors |>
  filter(major == "Psychology") |>
  ggplot(aes(earnings)) +
  geom_density(fill = "black", alpha = 0.3)

Multiple densities

Let’s compare the distribution of earnings from Psychology majors to two other majors. In the code below, take just the rows from the majors “Psychology”, “Music” and “Art History” and draw a plot with earnings on the x-axis using a geom_density layer. Set the fill aesthetic to the feature major and the alpha value to 0.3.

majors |>
  filter(major %in% c("Psychology", "Music", "Art History")) |>
  ggplot(aes(earnings)) +
  geom_density(aes(fill = major), alpha = 0.3)

How would you describe the differences and similarities between the estimated lifetime earnings of these three majors? Answer: Music majors have less variability in estimated lifetime earnings than either Psychology or Art History majors. Most music majors have more similar and lower earnings than Psychology or Art History majors. Psychology and Art History majors appear to have similar modal earnings. However, Art History majors have higher variability in lifetime earnings than Psychology majors, with a long tail of higher earners.

Plotting Percentiles

Let’s now look at all of the majors in the data. In the plot below, take just the rows of the data corresponding to the 50th percentile of earnings (i.e., average earnings within a major). Draw a scatterplot with earnings on the x-axis and major on the y-axis. Order the majors from the smallest to the largest earnings.

Hint

One way to approach this would be to combine mutate with fct_inorder to order the majors for plotting.

majors |>
  filter(percentile == 50) |>
  arrange(earnings) |>
  mutate(major = fct_inorder(major)) |>
  ggplot(aes(earnings, major)) +
  geom_point()

Comparing Percentiles

The point of the data we have today is to go beyond a single number in order to get a different picture of how major influences lifetime earnings. In the plot below, create a new version of the previous plot, but include two dots for each major, one for the 25th percentile of earnings and another for the 75th percentile of earnings. Color the 25th percentile in "olivedrab" and the 75th percentile "navy". Order the majors based on earnings.

Note: There are at least two different ways to do this task. Either is fine, but try to do it without creating a second dataset using mutate and if_else to construct a new color variable.

majors |>
  filter(percentile %in% c(25, 75)) |>
  arrange(desc(earnings)) |>
  mutate(major = fct_inorder(major)) |>
  mutate(color = if_else(percentile == 25, "olivedrab", "navy")) |>
  ggplot(aes(earnings, major)) +
  geom_point(aes(color = color)) +
  scale_color_identity()

Repeat the previous question but produce a plot showing the 10th and 90th percentiles of each major.

majors |>
  filter(percentile %in% c(10, 90)) |>
  arrange(desc(earnings)) |>
  mutate(major = fct_inorder(major)) |>
  mutate(color = if_else(percentile == 10, "olivedrab", "navy")) |>
  ggplot(aes(earnings, major)) +
  geom_point(aes(color = color)) +
  scale_color_identity()

Comparing Percentages to Major

The 10th percentile of earnings for Chemical Engineers is $2.63 million USD over their lifetime. In the plot below, draw a scatter plot that shows the percentage of people in each major that are expected to make at least 2.63 million dollars over the course of their life. Put major on the y-axis and the percentage on the x-axis and order by the percentage.

majors |>
  filter(earnings >= 2.63) |>
  arrange(percentile) |>
  group_by(major) |>
  slice(1) |>
  ungroup() |>
  arrange(desc(percentile)) |>
  mutate(major = fct_inorder(major)) |>
  mutate(percentage = 100 - percentile) |>
  ggplot(aes(percentage, major)) +
  geom_point()

Comparing Median Proportions

In the final plot, let’s compare the actual differences in the median earnings of each major. Filter to only include the 50th percentile of earnings and create a new variable earnings_prop that measures the earnings of each major divided by the median earnings of all majors. Create a plot similar to the others above where the majors are ordered from the lowest earnings to the highest earnings with the proportion of median earnings on the x-axis. As this is our final graph for today, add labs to the plot that describe its axes.

majors |>
  filter(percentile == 50) |>
  mutate(earnings_prop = earnings / median(earnings)) |>
  arrange(earnings_prop) |>
  mutate(major = fct_inorder(major)) |>
  ggplot(aes(earnings_prop, major)) +
  geom_point() +
  labs(
    x = "Proportion of median earnings (all majors)",
    y = "Undergraduate major",
    title = "Proportion of median earnings of all majors by major"
  )

The shape won’t be much different than the previous plots, but the scale of the x-axis is (to me) more interpretable than the abstract idea of “lifetime earnings.”

Reflections

Data Feminism says that we should consider the values and power structures that are baked into the data and data analyses that we do. What values seem to be expressed in the data here regarding college majors and life goals?

Answer: The organization of this data has two closely related values implicitly coded into the way it is organized. First, it implies that college education is primarily pre-professional, with the primary goal of training future participants in the work force. Secondly, it builds in the assumption that lifetime earnings is an important metric to judge what it means to be well educated.

Data Feminism also says that data is never neutral or non-political. The data here was an attempt to add nuance to studies focused only on a single summary showing the relationship between college majors and earnings. Can you think of another quantitative dataset that would help augment the story told by the data we looked at above?

Answer: The data here gives a wide view of a lot of data (it’s a 1 percent sample of all US households). It would be nice to augment it will some more specific datasets, such as the earnings data of all college graduates from a single university. It would also be good to have another version of the data that is augmented with additional metadata, such as whether graduates have an additional advanced degree, second major, or to control for the cost of living. Analyses from all these different approaches would help build a more complete picture.

Can you think of a qualitative source of information that would help augment the story told by the quantitative data?

Answer: It would be good to include qualitative information such as ethnographic interviews and close analyses of the careers of graduates with different majors. This would help also to get outside of the idea that the only measurement of success is the amount of money that someone makes.

It should be clear from these plots that while there is a relationship between college majors and lifetime earnings, the situation is a bit more complicated than the story told with a single number (say, the median earnings by major). Can you construct a story that would use some of the results here to encourage someone to major in a field that interests them rather than overly optimizing for major that have the highest typical salaries?

Answer: While some majors do make more money that others on average, there is a considerable range within the majors. Particularly if we ignore the small, specialized engineering majors, there is considerable overlap in the distribution of incomes. It seems that it would be better to a well-paid geography major than a typical accounting major. So, if you love and are good at geography, you should follow that path even if you are primarily motivated by money.

It is also possible to turn the data visualizations here into a fairly negative story about lifetime earnings. Try to describe in the most pessimistic way possible how you might summarize the data here. The goal of this exercise is to show how data analysis is never neutral and the same results can be interpreted very differently.

Answer: The data can also been seen an evidence that we live in a deeply unfair economic system. There are a few technical majors (mostly engineers) that make the highest salaries. These majors are only available at specialized schools that are only accessible to students coming from rich high school systems with strong math and science programs. These majors are also typically heavily dominated by male students. There is a wide range of salaries in other majors, which we cannot say much more about definitively without more data, but it is quite reasonable to assume that the upper percentiles are dominated by White students from affluent backgrounds at top schools.

Note that my solutions to the answers above are just some of the possible responses that are possible and are certainly not exhaustive of all ideas and reflections that you might have about the data.